1 Data input

Let’s start with reading the computed metrics for all projects.

## [1] TRUE
## 'data.frame':    2835 obs. of  22 variables:
##  $ project             : Factor w/ 13 levels "black","cookiecutter",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ bug_number          : int  1 3 4 6 7 8 10 11 14 15 ...
##  $ granularity         : Factor w/ 3 levels "function","statement",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ technique           : Factor w/ 7 levels "DStar","Metallaxis",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ crashing            : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ predicate           : logi  FALSE FALSE FALSE TRUE TRUE TRUE ...
##  $ ismutable           : logi  FALSE FALSE FALSE TRUE TRUE TRUE ...
##  $ mutability          : num  0 0 0 0.112 0.119 ...
##  $ time                : num  132.4 104.2 68.5 58.8 64.8 ...
##  $ einspect            : num  4 100.5 39.5 11 29 ...
##  $ is_bug_localized    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ exam                : num  0.0099 0.3018 0.1282 0.0364 0.0967 ...
##  $ java_exam_score     : num  0.0099 0.3018 0.1282 0.0364 0.0967 ...
##  $ output_length       : int  265 197 188 182 180 180 175 174 166 171 ...
##  $ cdist               : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ svcomp              : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ cumulative_distance2: num  NA NA NA NA NA NA NA NA NA NA ...
##  $ minutes             : num  2.21 1.74 1.14 0.98 1.08 ...
##  $ logtime             : num  4.89 4.65 4.23 4.07 4.17 ...
##  $ family              : Factor w/ 4 levels "MBFL","PS","ST",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ category            : Factor w/ 4 levels "CL","DEV","DS",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ bugid               : Factor w/ 135 levels "black1","black10",..: 1 9 10 11 12 13 2 3 4 5 ...

We have data about 135 bugs in rlength(unique(datas$project))` analyzed projects.

2 Pairwise comparisons

Let’s see an example of visual and statistical comparison of two groups of experiments for the same bugs.

To make the example concrete, let’s pick two groups and compare their \(E_{\text{inspect}}\) scores on statement-level fault localization:

  • \(S\) are the experiments done with any SBFL technique
  • \(M\) are the experiments done with any MBFL technique

Since there are three experiments per bug using SBFL, but only two experiments per bug using MBFL, we’ll aggregate scores for the same bug by average.

Let’s start with some visualization: a scatterplot with a point for each bug; each point has coordinates \(x, y\) where \(x\) is its score in MBFL and \(y\) its score in SBFL.

As you can see, there are a bulk of bugs for which SBFL performs very similarly to MBFL (points close to the \(x = y\) straight line). However, for several other bugs, SBFL is much better (remember that lower is better for this score).

Looking at the colors, we notice that several bugs in the CL (and possibly DS) category are overrepresented among the “harder” bugs on which SBFL behaves much better than MBFL.

Analyzing the same data numerically, we can compute the correlation (Kendall’s \(\tau\)) between \(S\) and \(M\):

## 
##  Kendall's rank correlation tau
## 
## data:  S and M
## z = 7.8047, p-value = 5.965e-15
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
##       tau 
## 0.5403952

A correlation of 0.5403952 is not super strong, but clearly defined.

Finally, we may also perform a statistical test (Wilcoxon’s paired test) and compute a matching effect size (Cliff’s delta).

## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  S and M
## V = 1068, p-value = 0.005209
## alternative hypothesis: true location shift is not equal to 0
## 
## Cliff's Delta
## 
## delta estimate: -0.1761866 (small)
## 95 percent confidence interval:
##       lower       upper 
## -0.29548868 -0.05147369

Cliff’s delta, in particular, roughly measures how often the value in one set are larger than the value in the other set. Thus, the given value means that SBFL’s \(E_{\text{inspect}}\) score is smaller than MBFL’s roughly in 18% of the cases.

These statistics, for what they’re worth, seem to confirm that there is a noticeable difference in favor of SBFL.

Now, let’s generalize this to a scatterplot matrix to show the relations between all possible pairs of FL families.

First, we define a bunch of helper functions.

Then, we use them to generate plots for \(E\).

Now, it’s easy to compute a similar plot for other metrics. For example, running time (in minutes):

And also for technique in the SBFL and MBFL families (those where there is more than one technique of the same family).

3 Regression models

Let’s build a simple multivariate regression model, where we predict einspect and logtime from bug and technique.

Notice that we found it preferable to log-transform time (in seconds), since this helps with the wide range of variability of running times among techniques. In particular, ST runs in a matter of seconds, two order of magnitudes faster than the next fastest family SBFL. If we do not log-transform time, we still get generally sensible results, but the advantage of ST over SBFL becomes watered down and less clear. Thus, we stick with the log-transformed time.

Before proceeding with fitting, we standardize both predictors, so that it’s much easier to set sensible priors.

3.1 Model \(m_1\): baseline multivariate regression

Here’s a basic regression model, where the only unusual aspects are that it’s multivariate, and uses a log-link function..

eq.m1 <- brmsformula(
  mvbind(einspectS, timeS) ~ 0 + family + category,
  family=brmsfamily("gaussian", link="log")
) + set_rescor(TRUE)

pp1.check <- get_prior(eq.m1, data=by.statement)

pp1 <- c(
  set_prior("normal(0, 1.0)", class="b", resp=c("einspectS", "timeS")),
  set_prior("weibull(2, 1)", class="sigma", resp=c("einspectS", "timeS"))
)

3.2 Fitting \(m_1\)

Let’s do the usual checks to make sure that everything is fine with the fitting.

Prior checks, confirming that the sampled priors span a wide range of values, amply including the data.

Now we fit the actual model.

## Start sampling
## Running MCMC with 4 chains, at most 8 in parallel...
## 
## Chain 1 finished in 10.1 seconds.
## Chain 3 finished in 10.0 seconds.
## Chain 4 finished in 10.1 seconds.
## Chain 2 finished in 10.9 seconds.
## 
## All 4 chains finished successfully.
## Mean chain execution time: 10.3 seconds.
## Total execution time: 11.2 seconds.

Next, we check the usual diagnostics:

  • No (or at most a few) divergent transitions
  • \(\widehat{R}\) ratio below \(1.01\)
  • Effective sample size (ESS), as a ratio of the total sample size, at least 10%
## [1] 0
## [1] 1.003669
## [1] 0.3695669

Finally, we check the posteriors, to ensure that we have a decent approximation of the data.

As you can see, the simulated posteriors are decent given that the data is complex, whereas the model is quite simplistic (we’ll improve it soon).

3.3 Analyzing \(m_1\)

##  Family: MV(gaussian, gaussian) 
##   Links: mu = log; sigma = identity
##          mu = log; sigma = identity 
## Formula: einspectS ~ 0 + family + category 
##          timeS ~ 0 + family + category 
##    Data: by.statement (Number of observations: 945) 
##   Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup draws = 4000
## 
## Population-Level Effects: 
##                       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
## einspectS_familyMBFL     -2.96      0.49    -3.98    -2.09 1.00     6679
## einspectS_familyPS        0.43      0.08     0.27     0.58 1.00     5758
## einspectS_familyST        0.54      0.08     0.37     0.69 1.00     4489
## einspectS_familySBFL     -3.39      0.47    -4.41    -2.56 1.00     5321
## einspectS_categoryDEV    -2.52      0.49    -3.63    -1.71 1.00     4888
## einspectS_categoryDS     -0.68      0.13    -0.92    -0.44 1.00     5069
## einspectS_categoryWEB    -2.29      0.47    -3.34    -1.50 1.00     5021
## timeS_familyMBFL         -0.60      0.14    -0.90    -0.35 1.00     1671
## timeS_familyPS           -1.26      0.18    -1.62    -0.93 1.00     2571
## timeS_familyST           -4.60      0.45    -5.55    -3.79 1.00     5776
## timeS_familySBFL         -4.12      0.44    -5.08    -3.34 1.00     5346
## timeS_categoryDEV         0.64      0.17     0.32     0.98 1.00     1960
## timeS_categoryDS          0.86      0.15     0.58     1.18 1.00     1726
## timeS_categoryWEB         0.33      0.21    -0.10     0.73 1.00     2433
##                       Tail_ESS
## einspectS_familyMBFL      2906
## einspectS_familyPS        3133
## einspectS_familyST        3083
## einspectS_familySBFL      2467
## einspectS_categoryDEV     2175
## einspectS_categoryDS      3115
## einspectS_categoryWEB     2318
## timeS_familyMBFL          2202
## timeS_familyPS            3098
## timeS_familyST            2816
## timeS_familySBFL          2223
## timeS_categoryDEV         2613
## timeS_categoryDS          2312
## timeS_categoryWEB         3071
## 
## Family Specific Parameters: 
##                 Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma_einspectS     0.86      0.02     0.82     0.90 1.00     6494     3213
## sigma_timeS         0.85      0.02     0.81     0.88 1.00     5999     2868
## 
## Residual Correlations: 
##                         Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
## rescor(einspectS,timeS)     0.10      0.04     0.03     0.18 1.00     5418
##                         Tail_ESS
## rescor(einspectS,timeS)     3301
## 
## Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).

What’s noticeable here is the residual correlation between the two outcomes einspect and time is smallish (10%), which means that there is not much of a consistent dependency between these two variables.

Let’s set up some functions to analyze the posterior samples of m1 (and similar models).

Let’s use these functions to first analyze the effects per family of FL techniques.

## $ints
##             MBFL       PS       SBFL       ST
## |0.5  -1.8353102 1.739195 -2.3158802 1.827165
## |0.7  -2.0553602 1.711112 -2.4126302 1.804245
## |0.9  -2.3861902 1.654794 -2.8586302 1.751850
## |0.95 -2.5145002 1.624174 -2.9680702 1.720207
## |0.99 -2.9997502 1.549535 -3.2901202 1.673284
## 0.99| -0.5121802 1.952189 -0.9237502 2.096772
## 0.95| -0.6458902 1.927664 -1.1574802 2.037926
## 0.9|  -0.8250402 1.910703 -1.3378402 2.020518
## 0.7|  -1.0481902 1.869024 -1.4724602 1.972580
## 0.5|  -1.1912202 1.842191 -1.7029002 1.937703
## 
## $est
## NULL
## $ints
##           MBFL        PS       SBFL         ST
## |0.5  1.960318 1.2660575 -1.6757025 -2.1966025
## |0.7  1.939706 1.2091175 -1.8622325 -2.3645725
## |0.9  1.830394 1.0957975 -2.1541725 -2.6666025
## |0.95 1.757469 1.0383075 -2.3414025 -2.9301925
## |0.99 1.681878 0.9058275 -2.7163425 -3.1681425
## 0.99| 2.409543 1.8271925 -0.4405425 -0.8711025
## 0.95| 2.307460 1.7190265 -0.6362325 -1.1655825
## 0.9|  2.284562 1.6675425 -0.7555525 -1.2291625
## 0.7|  2.225164 1.5814775 -0.9866125 -1.4500425
## 0.5|  2.150519 1.5101775 -1.1078825 -1.6046025
## 
## $est
## NULL

It’s clear that for both outcomes, e_inspect and time, there are clear differences (with high probability) in the contribution over the mean from different families of techniques.

Looking at the effects by category of project does not yeld as strong differences, but we can see that DS projects tend to be associated with worse (higher) e_inspect.

## $ints
##               DEV        DS         WEB
## |0.5  -0.91667549 1.0764985 -0.62714549
## |0.7  -1.05566549 1.0244915 -0.86156549
## |0.9  -1.45905549 0.9409025 -1.16507549
## |0.95 -1.68509549 0.9035745 -1.39217549
## |0.99 -2.07679549 0.8313065 -1.80509549
## 0.99|  0.35950451 1.4586825  0.51750451
## 0.95|  0.18894451 1.3829805  0.39730451
## 0.9|   0.09513451 1.3503185  0.33071451
## 0.7|  -0.10926549 1.2855685  0.09100451
## 0.5|  -0.28688549 1.2429085 -0.01610549
## 
## $est
## NULL
## $ints
##               DEV           DS         WEB
## |0.5  -0.08602014  0.150728859 -0.40916514
## |0.7  -0.14248514  0.087022859 -0.47240714
## |0.9  -0.24966114 -0.004000141 -0.62202694
## |0.95 -0.29821414 -0.026865141 -0.69047224
## |0.99 -0.37835714 -0.119299141 -0.87612214
## 0.99|  0.48681286  0.667462859  0.22649886
## 0.95|  0.35430886  0.565192859  0.14131586
## 0.9|   0.28288386  0.487022859  0.07624786
## 0.7|   0.19699386  0.395402859 -0.04057914
## 0.5|   0.13734386  0.352488859 -0.13661914
## 
## $est
## NULL

3.4 Model \(m_2\): multivariate varying effects

Let’s make the model more sophisticated, with varying effects, and modeling these effects as possibly correlated (which makes sense, since we have two model parts)

eq.m2 <- brmsformula(
  mvbind(einspectS, timeS) ~ 1 + (1|p|family) + (1|q|category),
  family=brmsfamily("gaussian", link="log")
) + set_rescor(TRUE)

pp2.check <- get_prior(eq.m2, data=by.statement)

pp2 <- c(
  set_prior("normal(0, 1.0)", class="Intercept", resp=c("einspectS", "timeS")),
  set_prior("weibull(2, 0.3)", class="sd", coef="Intercept", 
            group="family", resp=c("einspectS", "timeS")),
  set_prior("weibull(2, 0.3)", class="sd", coef="Intercept", 
            group="category", resp=c("einspectS", "timeS")),
  set_prior("gamma(0.01, 0.01)", class="sigma", resp=c("einspectS", "timeS"))
)

3.5 Fitting \(m_2\)

Let’s fit \(m_2\) and check the fit.

Prior checks:

We fit model \(m_2\).

## Start sampling
## Running MCMC with 4 chains, at most 8 in parallel...
## 
## Chain 2 finished in 68.7 seconds.
## Chain 4 finished in 69.2 seconds.
## Chain 1 finished in 71.5 seconds.
## Chain 3 finished in 73.2 seconds.
## 
## All 4 chains finished successfully.
## Mean chain execution time: 70.7 seconds.
## Total execution time: 73.4 seconds.

Diagnostics:

## [1] 0
## [1] 1.003221
## [1] 0.3363845

Posterior checks:

We can perhaps glean a small improvement compared to \(m_1\). Let’s compare the two models using LOO.

## Output of model 'm1':
## 
## Computed from 4000 by 945 log-likelihood matrix
## 
##          Estimate    SE
## elpd_loo  -2386.0  63.7
## p_loo        28.2   3.9
## looic      4772.0 127.3
## ------
## Monte Carlo SE of elpd_loo is 0.1.
## 
## All Pareto k estimates are good (k < 0.5).
## See help('pareto-k-diagnostic') for details.
## 
## Output of model 'm2':
## 
## Computed from 4000 by 945 log-likelihood matrix
## 
##          Estimate    SE
## elpd_loo  -2383.1  63.9
## p_loo        28.7   4.0
## looic      4766.1 127.7
## ------
## Monte Carlo SE of elpd_loo is 0.1.
## 
## All Pareto k estimates are good (k < 0.5).
## See help('pareto-k-diagnostic') for details.
## 
## Model comparisons:
##    elpd_diff se_diff
## m2  0.0       0.0   
## m1 -2.9       1.1

\(m_1\)’s score is more than 2.6 standard deviations worse than \(m_2\)’s, which is a significant difference in favor of \(m_2\) in terms of predictive capabilities.

3.6 Analyzing \(m_2\)

##  Family: MV(gaussian, gaussian) 
##   Links: mu = log; sigma = identity
##          mu = log; sigma = identity 
## Formula: einspectS ~ 1 + (1 | p | family) + (1 | q | category) 
##          timeS ~ 1 + (1 | p | family) + (1 | q | category) 
##    Data: by.statement (Number of observations: 945) 
##   Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup draws = 4000
## 
## Group-Level Effects: 
## ~category (Number of levels: 4) 
##                                          Estimate Est.Error l-95% CI u-95% CI
## sd(einspectS_Intercept)                      0.65      0.12     0.44     0.90
## sd(timeS_Intercept)                          0.39      0.10     0.22     0.63
## cor(einspectS_Intercept,timeS_Intercept)    -0.21      0.27    -0.68     0.33
##                                          Rhat Bulk_ESS Tail_ESS
## sd(einspectS_Intercept)                  1.00     4409     3005
## sd(timeS_Intercept)                      1.00     3590     2960
## cor(einspectS_Intercept,timeS_Intercept) 1.00     3043     2677
## 
## ~family (Number of levels: 4) 
##                                          Estimate Est.Error l-95% CI u-95% CI
## sd(einspectS_Intercept)                      0.91      0.12     0.69     1.17
## sd(timeS_Intercept)                          0.92      0.12     0.69     1.18
## cor(einspectS_Intercept,timeS_Intercept)    -0.04      0.18    -0.37     0.31
##                                          Rhat Bulk_ESS Tail_ESS
## sd(einspectS_Intercept)                  1.00     3766     2761
## sd(timeS_Intercept)                      1.00     4365     2681
## cor(einspectS_Intercept,timeS_Intercept) 1.00     3364     2659
## 
## Population-Level Effects: 
##                     Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## einspectS_Intercept    -2.08      0.53    -3.10    -1.06 1.00     1975     1927
## timeS_Intercept        -1.97      0.48    -2.90    -1.03 1.00     2581     2667
## 
## Family Specific Parameters: 
##                 Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma_einspectS     0.86      0.02     0.82     0.90 1.00     6091     2669
## sigma_timeS         0.84      0.02     0.80     0.88 1.00     5068     2374
## 
## Residual Correlations: 
##                         Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
## rescor(einspectS,timeS)     0.10      0.04     0.02     0.17 1.00     5481
##                         Tail_ESS
## rescor(einspectS,timeS)     2502
## 
## Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).

By category, there is a slight inverse correlation between the two outcomes einspect and time; this correlation disappears if we look at the family terms. The residual correlation is the same as in m_1.

Let’s now perform an effects analysis on the fitted coefficients of m2. First we introduce a summary function suitable for varying effects models.

Then we use the summary function to analyze the effects of the FL techniques.

## $ints
##             MBFL        PS        ST       SBFL
## |0.5  -2.5262775 1.1176850 1.2182375 -2.8801300
## |0.7  -2.7709055 0.9477579 1.0616545 -3.1258475
## |0.9  -3.2180915 0.6775959 0.7788652 -3.5455925
## |0.95 -3.5103197 0.5359255 0.6447676 -3.7660067
## |0.99 -4.0641862 0.2329609 0.3315840 -4.2131177
## 0.99| -0.7022095 2.5920590 2.7000748 -0.9786326
## 0.95| -1.0036095 2.3085197 2.4006195 -1.3338787
## 0.9|  -1.1621255 2.1533480 2.2573620 -1.5147860
## 0.7|  -1.4888745 1.8624790 1.9778870 -1.8374510
## 0.5|  -1.6824225 1.7035350 1.8101250 -2.0364775
## 
## $est
##      MBFL        PS        ST      SBFL 
## -2.132916  1.411442  1.518747 -2.473152
## $ints
##            MBFL         PS        ST       SBFL
## |0.5  1.5663375  0.8773568 -2.946875 -2.4867300
## |0.7  1.4073460  0.7103954 -3.209426 -2.7332090
## |0.9  1.1276775  0.4251590 -3.687169 -3.1826010
## |0.95 0.9677368  0.2451090 -3.925625 -3.4541217
## |0.99 0.6264373 -0.0473359 -4.352294 -4.0203258
## 0.99| 2.9782522  2.3132561 -1.073653 -0.5231171
## 0.95| 2.7062393  2.0606435 -1.438210 -0.8640560
## 0.9|  2.5780990  1.9228000 -1.598879 -1.0378780
## 0.7|  2.3357665  1.6489455 -1.930957 -1.4056300
## 0.5|  2.1721100  1.4788225 -2.124840 -1.6318900
## 
## $est
##      MBFL        PS        ST      SBFL 
##  1.862529  1.174628 -2.564445 -2.069716

The results are generally consistent with those of model \(m_1\), although some effects slightly weaken or strengthen.

Let’s see what happens for the bug/category of projects.

## $ints
##              CL        DEV          DS         WEB
## |0.5  0.8576515 -1.5181075  0.16223450 -1.33094250
## |0.7  0.7378781 -1.6876235  0.04688615 -1.50825150
## |0.9  0.5309405 -2.0290295 -0.18046820 -1.89460300
## |0.95 0.4075303 -2.2206682 -0.31008255 -2.12356500
## |0.99 0.1732455 -2.6481052 -0.55104856 -2.55194350
## 0.99| 2.0725035 -0.1770082  1.36204860  0.04386895
## 0.95| 1.7999425 -0.4056425  1.12326375 -0.19342832
## 0.9|  1.6730295 -0.5240378  0.99301000 -0.31669545
## 0.7|  1.4488815 -0.7515989  0.75661670 -0.56242150
## 0.5|  1.3150325 -0.9015535  0.62102200 -0.70125675
## 
## $est
##         CL        DEV         DS        WEB 
##  1.0899108 -1.2253783  0.3930164 -1.0401307
## $ints
##               CL         DEV          DS         WEB
## |0.5  -0.8226835  0.03591773  0.22997525 -0.23261775
## |0.7  -0.9169421 -0.05039076  0.15723225 -0.32312575
## |0.9  -1.0974810 -0.19879025  0.02132240 -0.48009930
## |0.95 -1.1933780 -0.28385260 -0.06325879 -0.57814627
## |0.99 -1.4628219 -0.48788514 -0.22869648 -0.75663510
## 0.99| -0.1105753  0.73849182  0.90394435  0.51016309
## 0.95| -0.2154603  0.59909808  0.76945853  0.35718410
## 0.9|  -0.2763427  0.52126265  0.71423625  0.28251385
## 0.7|  -0.4095806  0.39238235  0.57253150  0.14601540
## 0.5|  -0.4943050  0.31617325  0.49809975  0.06830863
## 
## $est
##          CL         DEV          DS         WEB 
## -0.66390180  0.17212172  0.36407508 -0.08639932

Here we see some differences, which may partly be due to the fact that \(m_2\) models the different categories more uniformly. Furthermore, some changes may simply mean that the per-category effects are small, and hence likely to fluctuate with inconsequential changes to the model.

3.7 Model \(m_3\): interactions

Now, let’s try a variant of \(m_2\) where we go back to fixed intercepts but add an interaction term between family of FL techniques and category of projects.

eq.m3 <- brmsformula(
  mvbind(einspectS, timeS) ~ 
    0 + family + category + (0 + family|r|category),
  family=brmsfamily("gaussian", link="log")
) + set_rescor(TRUE)

pp3.check <- get_prior(eq.m3, data=by.statement)

pp3 <- c(
  set_prior("normal(0, 1.0)", class="b", resp=c("einspectS", "timeS")),
  set_prior("gamma(0.01, 0.01)", class="sigma", resp=c("einspectS", "timeS")),
  set_prior("lkj(1)", class="cor"),
  set_prior("weibull(2, 0.3)", class="sd", resp=c("einspectS", "timeS"))
)

3.8 Fitting \(m_3\)

Let’s fit \(m_3\) and check the fit.

Prior checks:

We fit model \(m_3\).

## Start sampling
## Running MCMC with 4 chains, at most 8 in parallel...
## 
## Chain 2 finished in 106.9 seconds.
## Chain 4 finished in 108.5 seconds.
## Chain 3 finished in 110.1 seconds.
## Chain 1 finished in 113.4 seconds.
## 
## All 4 chains finished successfully.
## Mean chain execution time: 109.7 seconds.
## Total execution time: 113.6 seconds.

Diagnostics:

## [1] 0
## [1] 1.003169
## [1] 0.2236839

Posterior checks:

In line with what seen before, possibly a bit better.

Model comparison:

## Output of model 'm1':
## 
## Computed from 4000 by 945 log-likelihood matrix
## 
##          Estimate    SE
## elpd_loo  -2386.0  63.7
## p_loo        28.2   3.9
## looic      4772.0 127.3
## ------
## Monte Carlo SE of elpd_loo is 0.1.
## 
## All Pareto k estimates are good (k < 0.5).
## See help('pareto-k-diagnostic') for details.
## 
## Output of model 'm2':
## 
## Computed from 4000 by 945 log-likelihood matrix
## 
##          Estimate    SE
## elpd_loo  -2383.1  63.9
## p_loo        28.7   4.0
## looic      4766.1 127.7
## ------
## Monte Carlo SE of elpd_loo is 0.1.
## 
## All Pareto k estimates are good (k < 0.5).
## See help('pareto-k-diagnostic') for details.
## 
## Output of model 'm3':
## 
## Computed from 4000 by 945 log-likelihood matrix
## 
##          Estimate    SE
## elpd_loo  -2379.3  64.3
## p_loo        32.1   4.2
## looic      4758.6 128.5
## ------
## Monte Carlo SE of elpd_loo is 0.1.
## 
## Pareto k diagnostic values:
##                          Count Pct.    Min. n_eff
## (-Inf, 0.5]   (good)     944   99.9%   1049      
##  (0.5, 0.7]   (ok)         1    0.1%   1209      
##    (0.7, 1]   (bad)        0    0.0%   <NA>      
##    (1, Inf)   (very bad)   0    0.0%   <NA>      
## 
## All Pareto k estimates are ok (k < 0.7).
## See help('pareto-k-diagnostic') for details.
## 
## Model comparisons:
##    elpd_diff se_diff
## m3  0.0       0.0   
## m2 -3.8       5.0   
## m1 -6.7       4.9

\(m_2\)’s score is 0.76 standard deviations worse than \(m_3\)’s. This is not a significant improvement, not worth the additional complexity of model \(m_3\) (which also results in it being harder to interpret). Thus, we stick with \(m_2\) as our selected model.

3.9 Analyzing \(m_3\)

##  Family: MV(gaussian, gaussian) 
##   Links: mu = log; sigma = identity
##          mu = log; sigma = identity 
## Formula: einspectS ~ 0 + family + category + (0 + family | r | category) 
##          timeS ~ 0 + family + category + (0 + family | r | category) 
##    Data: by.statement (Number of observations: 945) 
##   Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup draws = 4000
## 
## Group-Level Effects: 
## ~category (Number of levels: 4) 
##                                                Estimate Est.Error l-95% CI
## sd(einspectS_familyMBFL)                           0.28      0.15     0.05
## sd(einspectS_familyPS)                             0.28      0.14     0.05
## sd(einspectS_familyST)                             0.31      0.14     0.08
## sd(einspectS_familySBFL)                           0.29      0.15     0.05
## sd(timeS_familyMBFL)                               0.27      0.13     0.05
## sd(timeS_familyPS)                                 0.41      0.15     0.12
## sd(timeS_familyST)                                 0.28      0.15     0.05
## sd(timeS_familySBFL)                               0.30      0.16     0.06
## cor(einspectS_familyMBFL,einspectS_familyPS)      -0.02      0.33    -0.63
## cor(einspectS_familyMBFL,einspectS_familyST)      -0.04      0.33    -0.65
## cor(einspectS_familyPS,einspectS_familyST)         0.01      0.33    -0.62
## cor(einspectS_familyMBFL,einspectS_familySBFL)     0.05      0.33    -0.59
## cor(einspectS_familyPS,einspectS_familySBFL)      -0.01      0.33    -0.64
## cor(einspectS_familyST,einspectS_familySBFL)      -0.04      0.33    -0.64
## cor(einspectS_familyMBFL,timeS_familyMBFL)         0.05      0.33    -0.59
## cor(einspectS_familyPS,timeS_familyMBFL)          -0.01      0.33    -0.64
## cor(einspectS_familyST,timeS_familyMBFL)          -0.11      0.32    -0.69
## cor(einspectS_familySBFL,timeS_familyMBFL)         0.04      0.33    -0.59
## cor(einspectS_familyMBFL,timeS_familyPS)           0.07      0.32    -0.56
## cor(einspectS_familyPS,timeS_familyPS)             0.04      0.31    -0.59
## cor(einspectS_familyST,timeS_familyPS)            -0.20      0.31    -0.74
## cor(einspectS_familySBFL,timeS_familyPS)           0.07      0.33    -0.59
## cor(timeS_familyMBFL,timeS_familyPS)               0.21      0.32    -0.47
## cor(einspectS_familyMBFL,timeS_familyST)           0.04      0.34    -0.59
## cor(einspectS_familyPS,timeS_familyST)            -0.02      0.34    -0.66
## cor(einspectS_familyST,timeS_familyST)            -0.02      0.34    -0.64
## cor(einspectS_familySBFL,timeS_familyST)           0.04      0.33    -0.60
## cor(timeS_familyMBFL,timeS_familyST)               0.03      0.33    -0.60
## cor(timeS_familyPS,timeS_familyST)                 0.02      0.33    -0.61
## cor(einspectS_familyMBFL,timeS_familySBFL)         0.03      0.33    -0.60
## cor(einspectS_familyPS,timeS_familySBFL)           0.04      0.33    -0.59
## cor(einspectS_familyST,timeS_familySBFL)          -0.06      0.33    -0.67
## cor(einspectS_familySBFL,timeS_familySBFL)         0.04      0.33    -0.59
## cor(timeS_familyMBFL,timeS_familySBFL)             0.06      0.33    -0.58
## cor(timeS_familyPS,timeS_familySBFL)               0.09      0.32    -0.56
## cor(timeS_familyST,timeS_familySBFL)               0.02      0.33    -0.62
##                                                u-95% CI Rhat Bulk_ESS Tail_ESS
## sd(einspectS_familyMBFL)                           0.61 1.00     3928     2036
## sd(einspectS_familyPS)                             0.58 1.00     2417     1754
## sd(einspectS_familyST)                             0.60 1.00     2423     2105
## sd(einspectS_familySBFL)                           0.61 1.00     3977     1706
## sd(timeS_familyMBFL)                               0.55 1.00     1492     1845
## sd(timeS_familyPS)                                 0.72 1.00     1746     1509
## sd(timeS_familyST)                                 0.61 1.00     4241     2286
## sd(timeS_familySBFL)                               0.65 1.00     4117     2494
## cor(einspectS_familyMBFL,einspectS_familyPS)       0.61 1.00     4241     2811
## cor(einspectS_familyMBFL,einspectS_familyST)       0.60 1.00     4230     2591
## cor(einspectS_familyPS,einspectS_familyST)         0.62 1.00     3842     3036
## cor(einspectS_familyMBFL,einspectS_familySBFL)     0.66 1.00     4839     2735
## cor(einspectS_familyPS,einspectS_familySBFL)       0.62 1.00     4683     2978
## cor(einspectS_familyST,einspectS_familySBFL)       0.60 1.00     4427     3008
## cor(einspectS_familyMBFL,timeS_familyMBFL)         0.67 1.00     3856     3030
## cor(einspectS_familyPS,timeS_familyMBFL)           0.63 1.00     3837     2863
## cor(einspectS_familyST,timeS_familyMBFL)           0.52 1.00     3216     3076
## cor(einspectS_familySBFL,timeS_familyMBFL)         0.66 1.00     3102     3318
## cor(einspectS_familyMBFL,timeS_familyPS)           0.66 1.00     2999     2592
## cor(einspectS_familyPS,timeS_familyPS)             0.63 1.00     3110     3105
## cor(einspectS_familyST,timeS_familyPS)             0.44 1.00     2478     2674
## cor(einspectS_familySBFL,timeS_familyPS)           0.65 1.00     2859     3036
## cor(timeS_familyMBFL,timeS_familyPS)               0.75 1.00     1552     3466
## cor(einspectS_familyMBFL,timeS_familyST)           0.66 1.00     6251     2864
## cor(einspectS_familyPS,timeS_familyST)             0.61 1.00     4846     2539
## cor(einspectS_familyST,timeS_familyST)             0.61 1.00     4305     2680
## cor(einspectS_familySBFL,timeS_familyST)           0.67 1.00     3225     3055
## cor(timeS_familyMBFL,timeS_familyST)               0.65 1.00     2745     3049
## cor(timeS_familyPS,timeS_familyST)                 0.65 1.00     3142     3447
## cor(einspectS_familyMBFL,timeS_familySBFL)         0.64 1.00     5482     2861
## cor(einspectS_familyPS,timeS_familySBFL)           0.64 1.00     4110     2875
## cor(einspectS_familyST,timeS_familySBFL)           0.59 1.00     3926     3090
## cor(einspectS_familySBFL,timeS_familySBFL)         0.65 1.00     3827     2946
## cor(timeS_familyMBFL,timeS_familySBFL)             0.66 1.00     3139     3044
## cor(timeS_familyPS,timeS_familySBFL)               0.67 1.00     2962     3484
## cor(timeS_familyST,timeS_familySBFL)               0.64 1.00     2192     3228
## 
## Population-Level Effects: 
##                       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
## einspectS_familyMBFL     -2.91      0.54    -4.06    -1.92 1.00     4282
## einspectS_familyPS        0.36      0.27    -0.22     0.91 1.00     1613
## einspectS_familyST        0.21      0.32    -0.49     0.75 1.00     1696
## einspectS_familySBFL     -3.35      0.54    -4.46    -2.32 1.00     3821
## einspectS_categoryDEV    -2.39      0.55    -3.56    -1.39 1.00     3427
## einspectS_categoryDS     -0.56      0.33    -1.25     0.07 1.00     1268
## einspectS_categoryWEB    -2.19      0.54    -3.30    -1.22 1.00     3829
## timeS_familyMBFL         -0.28      0.31    -0.80     0.39 1.00      960
## timeS_familyPS           -1.01      0.37    -1.71    -0.23 1.00     1226
## timeS_familyST           -4.29      0.51    -5.32    -3.30 1.00     2830
## timeS_familySBFL         -3.85      0.53    -4.91    -2.83 1.00     3641
## timeS_categoryDEV         0.23      0.39    -0.64     0.90 1.00      962
## timeS_categoryDS          0.50      0.39    -0.39     1.15 1.00     1036
## timeS_categoryWEB        -0.09      0.41    -1.01     0.59 1.00     1261
##                       Tail_ESS
## einspectS_familyMBFL      2990
## einspectS_familyPS        1867
## einspectS_familyST        2585
## einspectS_familySBFL      2637
## einspectS_categoryDEV     2776
## einspectS_categoryDS      2440
## einspectS_categoryWEB     2990
## timeS_familyMBFL          1521
## timeS_familyPS            1720
## timeS_familyST            2771
## timeS_familySBFL          2705
## timeS_categoryDEV         1728
## timeS_categoryDS          1608
## timeS_categoryWEB         1828
## 
## Family Specific Parameters: 
##                 Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma_einspectS     0.85      0.02     0.81     0.89 1.00     7246     2646
## sigma_timeS         0.84      0.02     0.81     0.88 1.00     6615     2828
## 
## Residual Correlations: 
##                         Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
## rescor(einspectS,timeS)     0.08      0.04     0.01     0.16 1.00     6008
##                         Tail_ESS
## rescor(einspectS,timeS)     3210
## 
## Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).

Instead of considering the fixed and varying effects of \(m_3\), we may estimate the marginal means for each family of FL techniques (results omitted for brevity, since we’ll focus on \(m_2\) anyway).

3.10 Model \(m_4\): bug-kind-specific interaction effects

Let’s now add predictors to \(m_2\), so as to study any effect of the kinds of bugs:

  • predicate is a Boolean value that identifies predicate-related bugs
  • crashing is a Boolean value that identifies crashing bugs
  • mutability is a nonnegative score that denotes the percentage of mutants that mutate a line in a bug’s ground truth
  • mutable is a Boolean that identifies the bugs with a positive mutability score

Since mutability/mutable are likely affecting category and einspect, it makes sense to add the predictor, so as to close the possible backdoor path \(\textrm{category} \leftarrow \textrm{mutable} \rightarrow \textrm{einspect}\).

We are only interested in controlling for bug kind for einspect, thus switch to an univariate model where einspect is the only outcome variable.

eq.m4.einspect <- brmsformula(einspectS ~ 1 
                              + (1|p|family) + (1|q|category) 
                              + predicate*family 
                              + crashing*family 
                              + ismutable*family,
                              family=brmsfamily("gaussian", link="log"))

eq.m4 <- eq.m4.einspect

pp4.check <- get_prior(eq.m4, data=by.statement)

pp4 <- c(
  set_prior("normal(0, 1.0)", class="Intercept"),
  set_prior("normal(0, 1.0)", class="b"),
  set_prior("weibull(2, 0.3)", class="sd", coef="Intercept", 
            group="family"),
  set_prior("weibull(2, 0.3)", class="sd", coef="Intercept", 
            group="category"),
  set_prior("gamma(0.01, 0.01)", class="sigma")
)

3.11 Fitting \(m_4\)

Prior checks:

We fit model \(m_4\).

## Start sampling
## Running MCMC with 4 chains, at most 8 in parallel...
## 
## Chain 4 finished in 25.8 seconds.
## Chain 2 finished in 26.9 seconds.
## Chain 3 finished in 28.1 seconds.
## Chain 1 finished in 34.8 seconds.
## 
## All 4 chains finished successfully.
## Mean chain execution time: 28.9 seconds.
## Total execution time: 34.9 seconds.

Diagnostics:

## [1] 0
## [1] 1.003327
## [1] 0.3383162

Posterior checks:

Since \(m_4\) uses less data than the previous models (it doesn’t consider outcome time), we cannot it compare it to the other models using LOO (or any information criterion, for that matter).

3.12 Analyzing \(m_4\)

##  Family: gaussian 
##   Links: mu = log; sigma = identity 
## Formula: einspectS ~ 1 + (1 | p | family) + (1 | q | category) + predicate * family + crashing * family + ismutable * family 
##    Data: by.statement (Number of observations: 945) 
##   Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup draws = 4000
## 
## Group-Level Effects: 
## ~category (Number of levels: 4) 
##               Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sd(Intercept)     0.70      0.12     0.48     0.95 1.00     4044     3090
## 
## ~family (Number of levels: 4) 
##               Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sd(Intercept)     0.46      0.20     0.10     0.85 1.00     1799     1862
## 
## Population-Level Effects: 
##                          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
## Intercept                   -2.67      0.65    -3.88    -1.36 1.00     2428
## predicateTRUE               -0.46      0.50    -1.45     0.50 1.00     2012
## familyPS                     1.62      0.68     0.22     2.87 1.00     2388
## familyST                     1.89      0.69     0.38     3.12 1.00     2314
## familySBFL                  -1.20      0.70    -2.57     0.17 1.00     4081
## crashingTRUE                -1.15      0.52    -2.18    -0.15 1.00     2799
## ismutableTRUE               -0.25      0.48    -1.16     0.69 1.00     1965
## predicateTRUE:familyPS      -0.07      0.52    -1.06     0.94 1.00     2010
## predicateTRUE:familyST       0.48      0.50    -0.49     1.48 1.00     1980
## predicateTRUE:familySBFL    -0.02      0.86    -1.73     1.66 1.00     5036
## familyPS:crashingTRUE        1.01      0.54    -0.02     2.06 1.00     2822
## familyST:crashingTRUE       -1.82      0.65    -3.12    -0.57 1.00     3334
## familySBFL:crashingTRUE      0.03      0.87    -1.67     1.67 1.00     4750
## familyPS:ismutableTRUE       1.01      0.49     0.02     1.96 1.00     2148
## familyST:ismutableTRUE       0.87      0.49    -0.12     1.81 1.00     1914
## familySBFL:ismutableTRUE    -0.29      0.81    -1.84     1.31 1.00     4679
##                          Tail_ESS
## Intercept                    2629
## predicateTRUE                2368
## familyPS                     2632
## familyST                     2882
## familySBFL                   2796
## crashingTRUE                 2913
## ismutableTRUE                2223
## predicateTRUE:familyPS       2540
## predicateTRUE:familyST       2460
## predicateTRUE:familySBFL     2962
## familyPS:crashingTRUE        2868
## familyST:crashingTRUE        2923
## familySBFL:crashingTRUE      2746
## familyPS:ismutableTRUE       2617
## familyST:ismutableTRUE       2329
## familySBFL:ismutableTRUE     2819
## 
## Family Specific Parameters: 
##       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma     0.78      0.02     0.75     0.82 1.00     6433     2648
## 
## Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).

Let’s now perform an effects analysis on the fitted coefficients of m4.

Specifically, we look at the (fixed) effects of the families associated with certain categories of bugs, for response einspect.

## $ints
##       crashing MBFL crashing PS crashing ST crashing SBFL
## |0.5     -1.4957225  0.64172875  -2.2535900    -0.5706570
## |0.7     -1.6966420  0.45311365  -2.5091630    -0.8849482
## |0.9     -2.0240975  0.14002575  -2.8906505    -1.4130940
## |0.95    -2.1821472 -0.01729245  -3.1170247    -1.6737832
## |0.99    -2.5282891 -0.35063686  -3.5874488    -2.1767845
## 0.99|     0.1074132  2.44609395  -0.2224062     2.1764674
## 0.95|    -0.1510499  2.05582775  -0.5708796     1.6714243
## 0.9|     -0.3032274  1.89343600  -0.7896973     1.4436930
## 0.7|     -0.6125321  1.57059350  -1.1579595     0.9424029
## 0.5|     -0.7958315  1.37090000  -1.3684550     0.6420823
## 
## $est
## crashing MBFL   crashing PS   crashing ST crashing SBFL 
##   -1.15095905    1.00587635   -1.82331692    0.02942783
## $ints
##       predicate MBFL predicate PS predicate ST predicate SBFL
## |0.5     -0.80539225   -0.4252112   0.14267150     -0.5879013
## |0.7     -0.97924005   -0.5941629  -0.04137106     -0.8908687
## |0.9     -1.27643350   -0.8829843  -0.30878825     -1.4396120
## |0.95    -1.44913900   -1.0634027  -0.49199142     -1.7304670
## |0.99    -1.85374910   -1.3837009  -0.76560686     -2.2351072
## 0.99|     0.81469355    1.3561932   1.89115115      2.0862540
## 0.95|     0.50288860    0.9428116   1.47530875      1.6589353
## 0.9|      0.33490690    0.7920933   1.31585700      1.3988270
## 0.7|      0.05072069    0.4752352   1.00939850      0.8711154
## 0.5|     -0.12628675    0.2727390   0.81871150      0.5510417
## 
## $est
## predicate MBFL   predicate PS   predicate ST predicate SBFL 
##    -0.46324267    -0.07029723     0.48350202    -0.02474266
## $ints
##       ismutable MBFL ismutable PS ismutable ST ismutable SBFL
## |0.5     -0.56463550   0.67018100   0.54703600     -0.8342113
## |0.7     -0.74068020   0.50529050   0.35888690     -1.1446125
## |0.9     -1.02027550   0.20357500   0.05512335     -1.6622670
## |0.95    -1.16404750   0.01718417  -0.12405297     -1.8383202
## |0.99    -1.46524645  -0.30899653  -0.47210512     -2.3507601
## 0.99|     1.03952155   2.31925060   2.11852570      1.6988942
## 0.95|     0.69069932   1.95853000   1.81475750      1.3090160
## 0.9|      0.52912870   1.82134500   1.66444000      1.0380830
## 0.7|      0.24578715   1.51742650   1.36737800      0.5435591
## 0.5|      0.07147647   1.33092500   1.19420250      0.2685713
## 
## $est
## ismutable MBFL   ismutable PS   ismutable ST ismutable SBFL 
##     -0.2487351      1.0068292      0.8698194     -0.2899824

So, crashing bugs are indeed easier for ST. In contrast, predicate-related bugs do not seem to be simpler for PS.

For the mutability bugs, we don’t find any consistent association. Thus, let’s try to add to the model a finer-grained dependency on mutability rather than just the boolean indicator mutable.

3.13 Model \(m_5\): mutability slope (failed attempt)

A simple way would be to introduce an interaction mutability\(\times\)family.

eq.m5.einspect <- brmsformula(einspectS ~ 1 
                              + (1|p|family) + (1|q|category) 
                              + predicate*family 
                              + crashing*family 
                              + mutability*family,
                              family=brmsfamily("gaussian", link="log"))

eq.m5 <- eq.m5.einspect

pp5.check <- get_prior(eq.m5, data=by.statement)

pp5 <- c(
  set_prior("normal(0, 1.0)", class="Intercept"),
  set_prior("normal(0, 1.0)", class="b"),
  set_prior("weibull(2, 0.3)", class="sd", coef="Intercept", 
            group="family"),
  set_prior("weibull(2, 0.3)", class="sd", coef="Intercept", 
            group="category"),
  set_prior("gamma(0.01, 0.01)", class="sigma")
)

We could get passable (not great) prior checks, but let’s cut to the chase and fit model \(m_5\).

## Start sampling
## Running MCMC with 4 chains, at most 8 in parallel...
## 
## Chain 1 finished in 3.6 seconds.
## Chain 3 finished in 3.5 seconds.
## Chain 4 finished in 96.5 seconds.
## Chain 2 finished in 114.8 seconds.
## 
## All 4 chains finished successfully.
## Mean chain execution time: 54.6 seconds.
## Total execution time: 115.2 seconds.
## Warning: 438 of 4000 (11.0%) transitions hit the maximum treedepth limit of 10.
## See https://mc-stan.org/misc/warnings for details.
## Warning: 2 of 4 chains have a NaN E-BFMI.
## See https://mc-stan.org/misc/warnings for details.

The first thing that we notice is that two of the four chains terminated very quickly (suspiciously fast), whereas the other two went awry and spinned for much longer. In addition, we got a number of scary warnings. This points to some region of the posterior that could not be sampled effectively.

Let’s see the diagnostics:

## [1] 0
## [1] 5.472564
## [1] 0.001008065

A disaster. Let’s also plot the trace plots.

Two chains are straight lines, and hence did not mix at all with the others!

Notice that the distribution of mutability is very skewed, which explains the difficulties in fitting \(m_5\).

3.14 Models \(m_6\): mutability slope (successful attempt)

The most straightforward way out of this ditch is to simply log-transform mutability (after adding 1 to all percentages so that all logs are defined).

by.statement$logmutability <- log(1 + by.statement$mutability)

eq.m6.einspect <- brmsformula(einspectS ~ 1 
                              + (1|p|family) + (1|q|category) 
                              + predicate*family 
                              + crashing*family 
                              + logmutability*family,
                              family=brmsfamily("gaussian", link="log"))

eq.m6 <- eq.m6.einspect

pp6.check <- get_prior(eq.m6, data=by.statement)

pp6 <- c(
  set_prior("normal(0, 1.0)", class="Intercept"),
  set_prior("normal(0, 1.0)", class="b"),
  set_prior("weibull(2, 0.3)", class="sd", coef="Intercept", 
            group="family"),
  set_prior("weibull(2, 0.3)", class="sd", coef="Intercept", 
            group="category"),
  set_prior("gamma(0.01, 0.01)", class="sigma")
)

Alternative ways to modif \(m_5\) so that it can be analyzed (which we mention but don’t further explore here):

  • Introducing a multi-level term, with einspect ~ log(x)*family, and log(x) = log(y) + a, where \(x/y = \textrm{mutability}\). This is based on rewriting \(\log(a/b) = \alpha\) into \(\log(a) = \alpha + \log(b)\).

  • The approach followed in this paper.

3.15 Fitting \(m_6\)

Prior checks:

We fit model \(m_6\).

## Start sampling
## Running MCMC with 4 chains, at most 8 in parallel...
## 
## Chain 3 finished in 34.1 seconds.
## Chain 2 finished in 35.4 seconds.
## Chain 4 finished in 35.9 seconds.
## Chain 1 finished in 37.2 seconds.
## 
## All 4 chains finished successfully.
## Mean chain execution time: 35.6 seconds.
## Total execution time: 37.3 seconds.

Diagnostics:

## [1] 0
## [1] 1.003259
## [1] 0.4447736

Posterior checks:

Everything is A-OK now.

Let’s compare the models \(m_4\) and \(m_6\) using LOO.

## Output of model 'm4':
## 
## Computed from 4000 by 945 log-likelihood matrix
## 
##          Estimate    SE
## elpd_loo  -1143.7  57.6
## p_loo        62.6   9.7
## looic      2287.5 115.2
## ------
## Monte Carlo SE of elpd_loo is NA.
## 
## Pareto k diagnostic values:
##                          Count Pct.    Min. n_eff
## (-Inf, 0.5]   (good)     938   99.3%   286       
##  (0.5, 0.7]   (ok)         6    0.6%   155       
##    (0.7, 1]   (bad)        1    0.1%   45        
##    (1, Inf)   (very bad)   0    0.0%   <NA>      
## See help('pareto-k-diagnostic') for details.
## 
## Output of model 'm6':
## 
## Computed from 4000 by 945 log-likelihood matrix
## 
##          Estimate    SE
## elpd_loo  -1155.0  62.4
## p_loo        52.5   8.7
## looic      2310.0 124.8
## ------
## Monte Carlo SE of elpd_loo is 0.2.
## 
## Pareto k diagnostic values:
##                          Count Pct.    Min. n_eff
## (-Inf, 0.5]   (good)     940   99.5%   404       
##  (0.5, 0.7]   (ok)         5    0.5%   118       
##    (0.7, 1]   (bad)        0    0.0%   <NA>      
##    (1, Inf)   (very bad)   0    0.0%   <NA>      
## 
## All Pareto k estimates are ok (k < 0.7).
## See help('pareto-k-diagnostic') for details.
## 
## Model comparisons:
##    elpd_diff se_diff
## m4   0.0       0.0  
## m6 -11.3      18.3

\(m_6\) and \(m_4\) are very close in terms of predictive capabilities.

3.16 Analyzing \(m_6\)

##  Family: gaussian 
##   Links: mu = log; sigma = identity 
## Formula: einspectS ~ 1 + (1 | p | family) + (1 | q | category) + predicate * family + crashing * family + logmutability * family 
##    Data: by.statement (Number of observations: 945) 
##   Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup draws = 4000
## 
## Group-Level Effects: 
## ~category (Number of levels: 4) 
##               Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sd(Intercept)     0.71      0.12     0.49     0.97 1.00     3682     3030
## 
## ~family (Number of levels: 4) 
##               Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sd(Intercept)     0.48      0.18     0.14     0.84 1.00     2042     2078
## 
## Population-Level Effects: 
##                          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
## Intercept                   -2.39      0.63    -3.63    -1.16 1.00     2389
## predicateTRUE               -0.26      0.49    -1.22     0.69 1.00     2375
## familyPS                     1.60      0.67     0.17     2.80 1.00     2430
## familyST                     1.90      0.66     0.48     3.12 1.00     2150
## familySBFL                  -1.28      0.71    -2.67     0.12 1.00     3825
## crashingTRUE                -1.19      0.53    -2.25    -0.19 1.00     2228
## logmutability               -0.53      0.35    -1.28     0.11 1.00     1944
## predicateTRUE:familyPS       0.06      0.50    -0.94     1.03 1.00     2371
## predicateTRUE:familyST       0.58      0.50    -0.38     1.55 1.00     2406
## predicateTRUE:familySBFL    -0.08      0.85    -1.78     1.54 1.00     4831
## familyPS:crashingTRUE        1.06      0.54     0.05     2.15 1.00     2270
## familyST:crashingTRUE       -1.79      0.68    -3.16    -0.49 1.00     3195
## familySBFL:crashingTRUE      0.01      0.84    -1.68     1.61 1.00     4947
## familyPS:logmutability       0.63      0.35    -0.00     1.36 1.00     1947
## familyST:logmutability       0.52      0.35    -0.13     1.26 1.00     1954
## familySBFL:logmutability    -0.11      0.63    -1.46     0.99 1.00     3366
##                          Tail_ESS
## Intercept                    2811
## predicateTRUE                2740
## familyPS                     2606
## familyST                     2775
## familySBFL                   2990
## crashingTRUE                 2284
## logmutability                2136
## predicateTRUE:familyPS       2542
## predicateTRUE:familyST       2562
## predicateTRUE:familySBFL     3103
## familyPS:crashingTRUE        2604
## familyST:crashingTRUE        2819
## familySBFL:crashingTRUE      3009
## familyPS:logmutability       2147
## familyST:logmutability       2130
## familySBFL:logmutability     2465
## 
## Family Specific Parameters: 
##       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma     0.80      0.02     0.76     0.83 1.00     4795     3099
## 
## Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
## $ints
##       logmutability MBFL logmutability PS logmutability ST logmutability SBFL
## |0.5         -0.75474450      0.375940500      0.266247750         -0.5156230
## |0.7         -0.88416690      0.254462550      0.147502000         -0.7639471
## |0.8         -0.97204130      0.182238400      0.078897560         -0.9411668
## |0.87        -1.05968925      0.115193315      0.008621432         -1.1116126
## |0.9         -1.12050650      0.085546755     -0.027511595         -1.2257940
## |0.95        -1.28001225     -0.004312041     -0.130285675         -1.4553477
## |0.99        -1.47134080     -0.178970675     -0.261238665         -1.8345544
## 0.99|         0.25625661      1.564178850      1.469844800          1.3301171
## 0.95|         0.10875320      1.362056500      1.258613250          0.9929937
## 0.9|          0.01516966      1.221132000      1.107307500          0.8401534
## 0.87|        -0.02294809      1.163467750      1.050067700          0.7640344
## 0.8|         -0.09323496      1.068632000      0.955611000          0.6521943
## 0.7|         -0.16356845      0.982931800      0.874557150          0.5276751
## 0.5|         -0.27993625      0.855096250      0.749353000          0.3318090
## 
## $est
## logmutability MBFL   logmutability PS   logmutability ST logmutability SBFL 
##         -0.5291964          0.6255506          0.5170813         -0.1092753
## [1] TRUE

There is a weak tendency for MBFL to do better on mutable bugs, but it can only be detected with 87% confidence (which is still decent). Incidentally, PS (and, to a lesser degree, ST) tends to perform worse on the same kinds of bugs, whereas SBFL is agnostic.

Finally, let’s also collect the varying intercepts estimates and intervals for the group-level terms for family and category. In \(m_6\) these now correspond to the effects on bugs that are in none of the special categories (crashing, predicate, mutable); since this is a relatively set, we don’t expect any very strong tendency (simply because the data is limited).

## $ints
##              MBFL          PS           ST        SBFL
## |0.5  -1.09837750 -0.03866348  0.008106222 -1.03776500
## |0.7  -1.31961950 -0.16916845 -0.123050900 -1.32181800
## |0.9  -1.79294900 -0.45615560 -0.354203300 -1.84372150
## |0.95 -2.05672725 -0.59820110 -0.500310250 -2.13864925
## |0.99 -2.59430055 -0.91827276 -0.792044045 -2.81509720
## 0.99|  0.31546326  1.81253300  1.835731450  0.50949495
## 0.95|  0.12512723  1.29965575  1.372258250  0.24451078
## 0.9|   0.03854548  1.09694800  1.147540000  0.13960190
## 0.7|  -0.16430215  0.72556065  0.774333900 -0.05992452
## 0.5|  -0.31820900  0.52408350  0.571569000 -0.19567100
## 
## $est
##       MBFL         PS         ST       SBFL 
## -0.7456581  0.2577865  0.3136117 -0.6669348
## $ints
##                CL        DEV         DS        WEB
## |0.5   0.60921450 -1.9213750  0.2900920 -1.8046375
## |0.7   0.47578020 -2.1417310  0.1576821 -2.0088575
## |0.9   0.21938015 -2.5400950 -0.1101333 -2.3996800
## |0.95  0.02653933 -2.7496218 -0.2652119 -2.6116985
## |0.99 -0.29597590 -3.1878640 -0.6054692 -2.9940812
## 0.99|  1.81451660 -0.4967978  1.4707381 -0.2732759
## 0.95|  1.56997425 -0.7112183  1.2597173 -0.5808576
## 0.9|   1.45210450 -0.8386855  1.1383780 -0.7016571
## 0.7|   1.23746050 -1.1045085  0.9314273 -0.9619804
## 0.5|   1.11337000 -1.2482250  0.7968602 -1.1282750
## 
## $est
##         CL        DEV         DS        WEB 
##  0.8508594 -1.6101808  0.5346215 -1.4848656

4 Summary plots

Let’s prepare and print some plots of the overall results for model \(m_6\).

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

In order to more clearly read the plots, let’s also save the various interval endpoints in absolute units, that is, convert them to the outcome scale, both in standardized units and in absolute units.

## [1] TRUE
## [1] TRUE
## [1] TRUE

5 Dump all data and plots

## Saving 7 x 5 in image
## [1] "paper/m2-family.pdf"
## Saving 7 x 5 in image
## [1] "paper/m2-category.pdf"
## Saving 7 x 5 in image
## [1] "paper/m6-crashing.pdf"
## Saving 7 x 5 in image
## [1] "paper/m6-predicate.pdf"
## Saving 7 x 5 in image
## [1] "paper/m6-mutable.pdf"